Methods and apparatus to generate computer-trained machine learning models to correct computer-generated errors in audience data

ABSTRACT

Methods, apparatus, systems and articles of manufacture are disclosed to generate computer-trained machine learning models to correct computer-generated errors in audience data. An example apparatus includes a query selector to select a plurality of features and a range of hyperparameters; a query generator to generate a plurality of machine learning models based on the plurality of features and the range of hyperparameters, and initiate training of the plurality of machine learning models based on demographic data in a privacy-protected cloud environment, the demographic data obtained from database proprietor user accounts corresponding to audience measurement panelists; and a model selector to select a first machine learning model from the plurality of machine learning models.

RELATED APPLICATION(S)

This patent arises from a non-provisional patent application that claimsthe benefit of U.S. Provisional Patent Application No. 63/024,260, whichwas filed on May 13, 2020. U.S. Provisional Patent Application No.63/024,260 is hereby incorporated herein by reference in its entirety.Priority to U.S. Provisional Patent Application No. 63/024,260 is herebyclaimed.

FIELD OF THE DISCLOSURE

This disclosure relates generally to monitoring audiences, and, moreparticularly, to methods and apparatus to generate computer-trainedmachine learning models to correct computer-generated errors in audiencedata.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example system to enable thegeneration of audience measurement metrics based on the merging of datacollected by a database proprietor and an audience measurement entity(AME).

FIG. 2 is an example block diagram of the example model generator ofFIG. 1.

FIG. 3 is an example block diagram of the example model analyzer of FIG.1.

FIG. 4 is a flowchart representative of example machine readableinstructions which may be executed to implement the example modelgenerator of FIGS. 1 and/or 2 to generate computer-generated machinelearning models and associated performance results.

FIG. 5 is a flowchart representative of example machine readableinstructions which may be executed to implement the example modelanalyzer of FIGS. 1 and/or 3 to aggregate the performance results andselect one or more of the computer-generated machine learning models touse in correcting computer-generated errors in audience data.

FIG. 6 is a block diagram of an example processing platform structuredto execute the instructions of FIGS. 4 and/or 5 to implement the examplemodel generator of FIGS. 1 and/or 2 and the example model analyzer ofFIGS. 1 and/or 3 to generate a plurality of computer-generated machinelearning models and select one or more of the computer-generated machinelearning models based on performance data to correct computer-generatederrors in audience data.

The figures are not to scale. In general, the same reference numberswill be used throughout the drawing(s) and accompanying writtendescription to refer to the same or like parts.

Unless specifically stated otherwise, descriptors such as “first,”“second,” “third,” etc. are used herein without imputing or otherwiseindicating any meaning of priority, physical order, arrangement in alist, and/or ordering in any way, but are merely used as labels and/orarbitrary names to distinguish elements for ease of understanding thedisclosed examples. In some examples, the descriptor “first” may be usedto refer to an element in the detailed description, while the sameelement may be referred to in a claim with a different descriptor suchas “second” or “third.” In such instances, it should be understood thatsuch descriptors are used merely for identifying those elementsdistinctly that might, for example, otherwise share a same name. As usedherein, “approximately” and “about” refer to dimensions that may not beexact due to manufacturing tolerances and/or other real worldimperfections. As used herein “substantially real time” refers tooccurrence in a near instantaneous manner recognizing there may be realworld delays for computing time, transmission, etc. Thus, unlessotherwise specified, “substantially real time” refers to real time+/−1second.

DETAILED DESCRIPTION

Audience measurement entities (AMEs) usually collect large amounts ofaudience measurement information from their panelists including thenumber of unique audience members for particular media and the number ofimpressions corresponding to each of the audience members. Uniqueaudience size, as used herein, refers to the total number of uniquepeople (e.g., non-duplicate people) who had an impression of (e.g., wereexposed to) a particular media item, without counting duplicate audiencemembers. As used herein, an impression is defined to be an event inwhich a home or individual accesses and/or is exposed to media (e.g., anadvertisement, content, a group of advertisements and/or a collection ofcontent). Impression count, as used herein, refers to the number oftimes audience members are exposed to a particular media item. Theunique audience size associated with a particular media item will alwaysbe equal to or less than the number of impressions associated with themedia item because, while all audience members by definition have atleast one impression of the media, an individual audience member mayhave more than one impression. That is, the unique audience size isequal to the impression count only when every audience member wasexposed to the media only a single time (i.e., the number of audiencemembers equals the number of impressions). Where at least one audiencemember is exposed to the media multiple times, the unique audience sizewill be less than the total impression count because multipleimpressions will be associated with individual audience members. Thus,unique audience size refers to the number of unique people in anaudience (without double counting any person) exposed to media for whichaudience metrics are being generated. Unique audience size may also bereferred to as unique audience, deduplicated audience size, deduplicatedaudience, or audience.

Techniques for monitoring user access to an Internet-accessible media,such as digital television (DTV) media and digital content ratings (DCR)media, have evolved significantly over the years. Internet-accessiblemedia is also known as digital media. In the past, such monitoring wasdone primarily through server logs. In particular, media providersserving media on the Internet would log the number of requests receivedfor their media at their servers. Basing Internet usage research onserver logs is problematic for several reasons. For example, server logscan be tampered with either directly or via zombie programs, whichrepeatedly request media from the server to increase the server logcounts. Also, media is sometimes retrieved once, cached locally and thenrepeatedly accessed from the local cache without involving the server.Server logs cannot track such repeat views of cached media. Thus, serverlogs are susceptible to both over-counting and under-counting errors.

As Internet technology advanced, the limitations of server logs wereovercome through methodologies in which the Internet media to be trackedwas tagged with monitoring instructions. In particular, monitoringinstructions (also known as a media impression request or a beaconrequest) are associated with the hypertext markup language (HTML) of themedia to be tracked. When a client requests the media, both the mediaand the impression request are downloaded to the client. The impressionrequests are, thus, executed whenever the media is accessed, be it froma server or from a cache.

The beacon instructions cause monitoring data reflecting informationabout the access to the media (e.g., the occurrence of a mediaimpression) to be sent from the client that downloaded the media to amonitoring server. Typically, the monitoring server is owned and/oroperated by an AME (e.g., any party interested in measuring or trackingaudience exposures to advertisements, media, and/or any other media)that did not provide the media to the client and who is a trusted thirdparty for providing accurate usage statistics (e.g., The NielsenCompany, LLC). Advantageously, because the beaconing instructions areassociated with the media and executed by the client browser wheneverthe media is accessed, the monitoring information is provided to the AMEirrespective of whether the client is associated with a panelist of theAME. In this manner, the AME is able to track every time a person isexposed to the media on a census-wide or population-wide level. As aresult, the AME can reliably determine the total impression count forthe media without having to extrapolate from panel data collected from arelatively limited pool of panelists within the population. Frequently,such beacon requests are implemented in connection with third-partycookies. Since the AME is a third party relative to the first partyserving the media to the client device, the cookie sent to the AME inthe impression request to report the occurrence of the media impressionof the client device is a third-party cookie. Third-party cookietracking is used by audience measurement servers to track access tomedia by client devices from first-party media servers.

Tracking impressions by tagging media with beacon instructions usingthird-party cookies is insufficient, by itself, to enable an AME toreliably determine the unique audience size associated with the media ifthe AME cannot identify the individual user associated with thethird-party cookie. That is, the unique audience size cannot bedetermined because the collected monitoring information does notuniquely identify the person(s) exposed to the media. Under suchcircumstances, the AME cannot determine whether two reported impressionsare associated with the same person or two separate people. The AME mayset a third-party cookie on a client device reporting the monitoringinformation to identify when multiple impressions occur using the samedevice. However, cookie information does not indicate whether the sameperson used the client device in connection with each media impression.Furthermore, the same person may access media using multiple differentdevices that have different cookies so that the AME cannot directlydetermine when two separate impressions are associated with the sameperson or two different people.

Furthermore, the monitoring information reported by a client deviceexecuting the beacon instructions does not provide an indication of thedemographics or other user information associated with the person(s)exposed to the associated media. To at least partially address thisissue, the AME establishes a panel of users who have agreed to providetheir demographic information and to have their Internet browsingactivities monitored. When an individual joins the panel, that personprovides corresponding detailed information concerning the person'sidentity and demographics (e.g., gender, race, income, home location,occupation, etc.) to the AME. The AME sets a cookie on the panelistcomputer that enables the AME to identify the panelist whenever thepanelist accesses tagged media and, thus, sends monitoring informationto the AME. Additionally or alternatively, the AME may identify thepanelists using other techniques (independent of cookies) by, forexample, prompting the user to login or identify themselves. While AMEsare able to obtain user-level information for impressions from panelists(e.g., identify unique individuals associated with particular mediaimpressions), most of the client devices providing monitoringinformation from the tagged pages are not panelists. Thus, the identityof most people accessing media remains unknown to the AME such that itis necessary for the AME to use statistical methods to imputedemographic information based on the data collected for panelists to thelarger population of users providing data for the tagged media. However,panel sizes of AMEs remain small compared to the general population ofusers.

There are many database proprietors operating on the Internet. Thesedatabase proprietors provide services to large numbers of subscribers.In exchange for the provision of services, the subscribers register withthe database proprietors. Examples of such database proprietors includesocial network sites (e.g., Facebook, Twitter, My Space, etc.),multi-service sites (e.g., Yahoo!, Google, Axiom, Catalina, etc.),online retailer sites (e.g., Amazon.com, Buy.com, etc.), creditreporting sites (e.g., Experian), streaming media sites (e.g., YouTube,Hulu, etc.), etc. These database proprietors set cookies and/or otherdevice/user identifiers on the client devices of their subscribers toenable the database proprietors to recognize their subscribers whentheir subscribers visit website(s) on the Internet domains of thedatabase proprietors.

The protocols of the Internet make cookies inaccessible outside of thedomain (e.g., Internet domain, domain name, etc.) on which they wereset. Thus, a cookie set in, for example, the YouTube.com domain (e.g., afirst party) is accessible to servers in theYouTube.com domain, but notto servers outside that domain. Therefore, although an AME (e.g., athird party) might find it advantageous to access the cookies set by thedatabase proprietors, they are unable to do so. However, techniques havebeen developed that enable an AME to leverage media impressioninformation collected in association with demographic information insubscriber databases of database proprietors to collect more extensiveInternet usage (e.g., beyond the limited pool of individualsparticipating in an AME panel) by extending the impression requestprocess to encompass partnered database proprietors and by using suchpartners as interim data collectors. In particular, this task isaccomplished by structuring the AME to respond to impression requestsfrom clients (who may not be a member of an audience measurement paneland, thus, may be unknown to the AME) by redirecting the clients fromthe AME to a database proprietor, such as a social network sitepartnered with the AME, using an impression response. Such a redirectioninitiates a communication session between the client accessing thetagged media and the database proprietor. For example, the impressionresponse received from the AME may cause the client to send a secondimpression request to the database proprietor along with a cookie set bythat database proprietor. In response to receiving this impressionrequest, the database proprietor (e.g., Facebook) can access the cookieit has set on the client to thereby identify the client based on theinternal records of the database proprietor.

In the event the client corresponds to a subscriber of the databaseproprietor (as determined from the cookie associated with the client),the database proprietor logs/records a database proprietor demographicimpression in association with the client/user. As used herein, ademographic impression is an impression that can be matched toparticular demographic information of a particular subscriber orregistered users of the services of a database proprietor. The databaseproprietor has the demographic information for the particular subscriberbecause the subscriber would have provided such information when settingup an account to subscribe to the services of the database proprietor.

Sharing of demographic information associated with subscribers ofdatabase proprietors enables AMEs to extend or supplement their paneldata with substantially reliable demographics information from externalsources (e.g., database proprietors), thus extending the coverage,accuracy, and/or completeness of their demographics-based audiencemeasurements. Such access also enables the AME to monitor persons whowould not otherwise have joined an AME panel. Any web service providerhaving a database identifying demographics of a set of individuals maycooperate with the AME. Such web service providers may be referred to as“database proprietors” and include, for example, wireless servicecarriers, mobile software/service providers, social media sites (e.g.,Facebook, Twitter, MySpace, etc.), online retailer sites (e.g.,Amazon.com, Buy.com, etc.), multi-service sites (e.g., Yahoo!, Google,Experian, etc.), and/or any other Internet sites that collectdemographic data of users and/or otherwise maintain user registrationrecords. The use of demographic information from disparate data sources(e.g., high-quality demographic information from the panels of anaudience measurement entity and/or registered user data of databaseproprietors) results in improved reporting effectiveness of metrics forboth online and offline advertising campaigns.

The above approach to generating audience metrics by an AME depends uponthe beacon requests (or tags) associated with the media to be monitoredto enable an AME to obtain census wide impression counts (e.g.,impressions that include the entire population exposed to the mediaregardless of whether the audience members are panelists of the AME).Further, the above approach also depends on third-party cookies toenable the enrichment of the census impressions with demographicinformation from database proprietors. However, in more recent years,there has been a movement away from the use of third-party cookies bythird parties. Thus, while media providers (e.g., database proprietors)may still use first-party cookies to collect first-party data, theelimination of third-party cookies prevents the tracking of Internetmedia by AMEs (outside of client devices associated with panelists forwhich the AME has provided a meter to track Internet usage behavior).Furthermore, independent of the use of cookies, some databaseproprietors are moving towards the elimination of third party impressionrequests or tags (e.g., redirect instructions) embedded in media (e.g.,beginning in 2020, third-party tags will no longer be allowed onYoutube.com and other Google Video Partner (GVP) sites). As technologymoves in this direction, AMEs (e.g., third parties) will no longer beable to track census wide impressions of media in the manner they havein the past. Furthermore, AMEs will no longer be able to send a redirectrequest to a client accessing media to cause a second impression requestto a database proprietor to associate the impression with demographicinformation. Thus, the only Internet media monitoring that AMEs will beable to directly perform in such a system will be with panelists thathave agreed to be monitored using different techniques that do notdepend on third-party cookies and/or tags.

Examples disclosed herein overcome at least some of the limitations thatarise out of the elimination of third-party cookies and/or third-partytags by enabling the merging of high-quality demographic informationfrom the panels of an AME with media impression data that continues tobe collected by database proprietors. As mentioned above, whilethird-party cookies and/or third-party tags may be eliminated, databaseproprietors that provide and/or manage the delivery of media accessedonline are still able to track impressions of the media (e.g., viafirst-party cookies and/or first-party tags). Furthermore, databaseproprietors are still able to associate demographic information with theimpressions whenever the impressions can be matched to a particularsubscriber of the database proprietor for which demographic informationhas been collected (e.g., when the user registered with the databaseproprietor). In some examples, the merging of AME panel data anddatabase proprietor impressions data is merged in a privacy-protectedcloud environment maintained by the database proprietor.

Examples disclosed herein generate computer-trained machine learningmodels to correct computer-generated errors in audience data, such asmisattribution errors and/or non-coverage errors. Misattribution errorrefers to the measurement bias (e.g., generated by a computer) thatoccurs when a first person belonging to a first demographic group isbelieved to be the person associated with a media impression on a devicewhen, in fact, a second person belonging to a second demographic group(e.g., a second demographic group different from the first demographicgroup) is the person for whom the media impression occurred. As usedherein, non-coverage error refers to the measurement bias (e.g.,generated by a computer) that occurs due to the inability of thedatabase proprietor to recognize (e.g., identify the demographics of) aportion of the audience using network-connected devices (e.g.,Internet-connected devices, mobile devices, smartphones, tablet devices,computers, Internet televisions, etc.) to view media. In examplesdisclosed herein, the privacy-protected cloud environment includes thecapability to run computer-generated machine learning models to correctfor the computer-generated errors in the audience data. However, inprior cloud environments, there is no ability to compare results betweendifferent computer-generated machine learning models to determine thebest performing variant of the computer-generated machine learningmodels.

Examples disclosed herein generate computer-trained machine learningmodels in a privacy-protected cloud environment and determineperformance results from the variants of the computer-trained machinelearning models. Examples disclosed herein use covariates associatedwith user data from a database proprietor (e.g., streaming browsingcategory data, search browsing category data, hours of the day the useris active, etc.) to generate the computer-trained machine learningmodels to correct for the computer-generated errors in audience data.Examples disclosed herein generate a plurality of machine learningmodels with varying combinations of features and ranges forhyperparameters. Examples disclosed herein run the differentcombinations of computer-generated machine learning models in parallelusing the user data from the database proprietor. Examples disclosedherein determine performance results for the differentcomputer-generated machine learning models (e.g., model accuracy,demographic data accuracy, etc.) based on a comparison of the results ofthe computer-generated machine learning models and the audience dataassociated with audience measurement panelists from the AME. Examplesdisclosed herein aggregate the performance results for thecomputer-generated machine learning models and select one or more of thecomputer-generated machine learning models based on the performanceresults to use in correcting the computer-generated errors in audiencedata.

More particularly, FIG. 1 is a block diagram illustrating an examplesystem 100 to enable the generation of audience measurement metricsbased on the merging of data collected by a database proprietor 102 andan AME 104. More particularly, in some examples, the data includes AMEpanel data (that includes media impressions for panelists that areassociated with high-quality demographic information collected by theAME 104) and database proprietor impressions data (which may be enrichedwith demographic and/or other information available to the databaseproprietor 102). In the illustrated example, these disparate sources ofdata are combined within a privacy-protected cloud environment 106managed and/or maintained by the database proprietor 102. Theprivacy-protected cloud environment 106 is a cloud-based environmentthat enables media providers (e.g., advertisers and/or contentproviders) and third parties (e.g., the AME 104) to input and combinetheir data with data from the database proprietor 102 inside a datawarehouse or data store that enables efficient big data analysis. Thecombining of data from different parties (e.g., different Internetdomains) presents risks to the privacy of the data associated withindividuals represented by the data from the different parties.Accordingly, the privacy-protected cloud environment 106 is establishedwith privacy constraints that prevent any associated party (includingthe database proprietor 102) from accessing private informationassociated with particular individuals. Rather, any data extracted fromthe privacy-protected cloud environment 106 following a big dataanalysis and/or query is limited to aggregated information. A specificexample that may be used to implement the privacy-protected cloudenvironment 106 is the Ads Data Hub (ADH) developed by Google LLC ofMountain View, Calif., U.S.A.

As used herein, a media impression is defined as an occurrence of accessand/or exposure to media 108 (e.g., an advertisement, a movie, a movietrailer, a song, a web page banner, etc.). Examples disclosed herein maybe used to monitor for media impressions of any one or more media types(e.g., video, audio, a web page, an image, text, etc.). In examplesdisclosed herein, the media 108 may be primary content and/oradvertisements. Examples disclosed herein are not restricted for usewith any particular type of media. On the contrary, examples disclosedherein may be implemented in connection with tracking impressions formedia of any type or form in a network.

In the illustrated example of FIG. 1, content providers and/oradvertisers distribute the media 108 via the Internet to users thataccess websites and/or online television services (e.g., web-based TV,Internet protocol TV (IPTV), etc.). For purposes of explanation,examples disclosed herein are described assuming the media 108 is anadvertisement that may be provided in connection with particular contentof primary interest to a user. In some examples, the media 108 is servedby media servers managed by and/or associated with the databaseproprietor 102 that manages and/or maintains the privacy-protected cloudenvironment 106. For example, the database proprietor 102 may be Google,and the media 108 corresponds to ads served with videos accessed viaYoutube.com and/or via other Google video partners (GVPs). Moregenerally, in some examples, the database proprietor 102 includescorresponding database proprietor servers that can serve media 108 toindividuals via client devices 110. In the illustrated example of FIG.1, the client devices 110 may be stationary or portable computers,handheld computing devices, smart phones, Internet appliances, smarttelevisions, and/or any other type of device that may be connected tothe Internet and capable of presenting media. For purposes ofexplanation, the client devices 110 of FIG. 1 include panelist clientdevices 112 and non-panelist client devices 114 to indicate that atleast some individuals that access and/or are exposed to the media 108correspond to panelists who have provided detailed demographicinformation to the AME 104 and have agreed to enable the AME 104 totrack their exposure to the media 108. In many situations, otherindividuals who are not panelists will also be exposed to the media 108(e.g., via the non-panelist client devices 114). Typically, the numberof non-panelist audience members for a particular media item will besignificantly greater than the number of panelist audience members. Insome examples, the panelist client devices 112 may include and/orimplement an audience measurement meter 115 that captures theimpressions of media 108 accessed by the panelist client devices 112(along with associated information) and reports the same to the AME 104.In some examples, the audience measurement meter 115 may be a separatedevice from the panelist client device 112 used to access the media 108.

In some examples, the media 108 is associated with a unique impressionidentifier (e.g., a consumer playback nonce (CPN)) generated by thedatabase proprietor 102. In some examples, the impression identifierserves to uniquely identify a particular impression of the media 108.Thus, even though the same media 108 may be served multiple times, eachtime the media 108 is served the database proprietor 102 will generate anew and different impression identifier so that each impression of themedia 108 can be distinguished from every other impression of the media.In some examples, the impression identifier is encoded into a uniformresource locator (URL) used to access the primary content (e.g., aparticular YouTube video) along with which the media 108 (as anadvertisement) is served. In some examples, with the impressionidentifier (e.g., CPN) encoded into the URL associated with the media108, the audience measurement meter 115 extracts the identifier at thetime that a media impression occurs so that the AME 104 is able toassociate a captured impression with the impression identifier.

In some examples, the meter 115 may not be able to obtain the impressionidentifier (e.g., CPN) to associate with a particular media impression.For instance, in some examples where the panelist client device 112 is amobile device, the meter 115 collects a mobile advertising identifier(MAID) and/or an identifier for advertisers (IDFA) that may be used touniquely identify client devices 110 (e.g., the panelist client devices112 being monitored by the AME 104). In some examples, the meter 115reports the MAID and/or IDFA for the particular device associated withthe meter 115 to the AME 104. The AME 104, in turn, provides the MAIDand/or IDFA to the database proprietor 102 in a double blind exchangethrough which the database proprietor 102 provides the AME 104 with theimpression identifiers (e.g., CPNs) associated with the client device110 identified by the MAID and/or IDFA. Once the AME 104 receives theimpression identifiers for the client device 110 (e.g., a particularpanelist client device 112), the impression identifiers are associatedwith the impressions previously collected in connection with the device.

In the illustrated example, the database proprietor 102 logs each mediaimpression occurring on any of the client devices 110 within theprivacy-protected cloud environment 106. In some examples, logging animpression includes logging the time the impression occurred and thetype of client device 110 (e.g., whether a desktop device, a mobiledevice, a tablet device, etc.) on which the impression occurred.Further, in some examples, impressions are logged along with theimpression's unique impression identifier. In this example, theimpressions and associated identifiers are logged in a campaignimpressions database 116. The campaign impressions database 116 storesall impressions of the media 108 regardless of whether any particularimpression was detected from a panelist client device 112 or anon-panelist client device 114. Furthermore, the campaign impressionsdatabase 116 stores all impressions of the media 108 regardless ofwhether the database proprietor 102 is able to match any particularimpression to a particular subscriber of the database proprietor 102. Asmentioned above, in some examples, the database proprietor 102identifies a particular user (e.g., subscriber) associated with aparticular media impression based on a cookie stored on the clientdevice 110. In some examples, the database proprietor 102 associates aparticular media impression with a user that was signed into the onlineservices of the database proprietor 102 at the time the media impressionoccurred. In some examples, in addition to logging such impressions andassociated identifiers in the campaign impressions database 116, thedatabase proprietor 102 separately logs such impressions in a matchableimpressions database 118. As used herein, a matchable impression is animpression that the database proprietor 102 is able to match to at leastone of a particular subscriber (e.g., because the impression occurred ona client device 110 on which a user was signed into the databaseproprietor 102) or a particular client device 110 (e.g., based on afirst-party cookie of the database proprietor 102 detected on the clientdevice 110). In some examples, if the database proprietor 102 cannotmatch a particular media impression (e.g., because no user was signed inat the time the media impression occurred and there is no recognizablecookie on the associated client device 110) the impressions is omittedfrom the matchable impressions database 118 but is still logged in thecampaign impressions database 116.

As indicated above, the matchable impressions database 118 includesmedia impressions (and associated unique impression identifiers) thatthe database proprietor 102 is able to match to a particular user thathas registered with the database proprietor 102. In some examples, thematchable impressions database 118 also includes user-based covariatesthat correspond to the particular user to which each impression in thedatabase was matched. As used herein, a user-based covariate refers toany item(s) of information collected and/or generated by the databaseproprietor 102 that can be used to identify, characterize, quantify,and/or distinguish particular users and/or their associated behavior.For example, user-based covariates may include the name, age, and/orgender of the user (and/or any other demographic information about theuser) collected at the time the user registered with the databaseproprietor 102, and/or the relative frequency with which the user usesthe different types of client device 110, the number of media items theuser has accessed during a most recent period of time (e.g., the last 30days), the search terms entered by the user during a most recent periodof time (e.g., the last 30 days), feature embeddings (numericalrepresentations) of classifications of videos viewed and/or searchesentered by the user, etc. As mentioned above, the matchable database 118also includes impressions matched to particular client devices 110(based on first-party cookies), even when the impressions cannot bematched to particular users (based on the users being signed in at thetime). In some such examples, the impressions matched to particularclient devices 110 are treated as distinct users within the matchabledatabase 118. However, as no particular user can be identified, suchimpressions in the matchable database 118 will not be associated withany user-based covariates.

Although only one campaign impressions database 116 is shown in theillustrated example, the privacy-protected cloud environment 106 mayinclude any number of campaign impressions databases 116, with eachdatabase storing impressions corresponding to different media campaignsassociated with one or more different advertisers (e.g., productmanufacturers, service providers, retailers, advertisement servers,etc.). In other examples, a single campaign impressions database 116 maystore the impressions associated with multiple different campaigns. Insome such examples, the campaign impressions database 116 may store acampaign identifier in connection with each impression to identify theparticular campaign to which the impression is associated. Similarly, insome examples, the privacy-protected cloud environment 106 may includeone or more matchable impressions databases 118 as appropriate. Further,in some examples, the campaign impressions database 116 and thematchable impressions database 118 may be combined and/or represented ina single database.

In the illustrated example of FIG. 1, impressions occurring on theclient devices 110 are shown as being reported (e.g., via networkcommunications) directly to both the campaign impressions database 116and the matchable impressions database 118. However, this should not beinterpreted as necessarily requiring multiple separate networkcommunications from the client devices 110 to the database proprietor102. Rather, in some examples, notifications of impressions arecollected from a single network communication from the client device110, and the database proprietor 102 then populates both the campaignimpressions database 116 and the matchable impressions database 118. Insome examples, the matchable impressions database 118 is generated basedon an analysis of the data in the campaign impressions database 116.Regardless of the particular process by which the two databases 116, 118are populated with logged impressions, in some examples, the user-basedcovariates included in the matchable impressions database 118 may becombined with the logged impressions in the campaign impressionsdatabase 116 and stored in an enriched impressions database 120. Thus,the enriched impressions database includes all (e.g., census wide)logged impressions of the media 108 for the relevant advertisingcampaign and also includes all available user-based covariatesassociated with each of the logged impressions that the databaseproprietor 102 was able to match to a particular user.

As shown in the illustrated example, whereas the database proprietor 102is able to collect impressions from both panelist client devices 112 andnon-panelist client devices 114, the AME 104 is limited to collectingimpressions from panelist client devices 112. In some examples, the AME104 also collects the impression identifier associated with eachcollected media impression so that the collected impressions may bematched with the impressions collected by the database proprietor 102 asdescribed further below. In the illustrated example, the impressions(and associated impression identifiers) of the panelists are stored inan AME panel data database 122 that is within an AME first party datastore 124 in an AME proprietary cloud environment 126. In some examples,the AME proprietary cloud environment 126 is a cloud-based storagesystem (e.g., a Google Cloud Project) provided by the databaseproprietor 102 that includes functionality to enable interfacing withthe privacy-protected cloud environment 106 also maintained by thedatabase proprietor 102. As mentioned above, the privacy-protected cloudenvironment 106 is governed by privacy constraints that prevent anyparty (with some limited exceptions for the database proprietor 102)from accessing private information associated with particularindividuals. By contrast, the AME proprietary cloud environment 126 isindicated as proprietary because it is exclusively controlled by the AMEsuch that the AME has full control and access to the data withoutlimitation. While some examples involve the AME proprietary cloudenvironment 126 being a cloud-based system that is provided by thedatabase proprietor 102, in other examples, the AME proprietary cloudenvironment 126 may be provided by a third party distinct from thedatabase proprietor 102.

While the AME 104 is limited to collected impressions (and associatedidentifiers) from only panelists (e.g., via the panelist client devices112), the AME 104 is able to collect panel data that is much more robustthan merely media impressions. As mentioned above, the panelist clientdevices 112 are associated with users that have agreed to participate ona panel of the AME 104. Participation in a panel includes the provisionof detailed demographic information about the panelist and/or allmembers in the panelist's household. Such demographic information mayinclude age, gender, race, ethnicity, education, employment status,income level, geographic location of residence, etc. In addition to suchdemographic information, which may be collected at the time a userenrolls as a panelist, the panelist may also agree to enable the AME 104to track and/or monitor various aspects of the user's behavior. Forexample, the AME 104 may monitor panelists' Internet usage behaviorincluding the frequency of Internet usage, the times of day of suchusage, the websites visited, and the media exposed to (from which themedia impressions are collected).

AME panel data (including media impressions and associated identifiers,demographic information, and Internet usage data) is shown in FIG. 1 asbeing provided directly to the AME panel data database 122 from thepanelist client devices 112. However, in some examples, there may be oneor more intervening operations and/or components that collect and/orprocess the collected data before it is stored in the AME panel datadatabase 122. For instance, in some examples, impressions are initiallycollected and reported to a separate server and/or database that isdistinct from the AME proprietary cloud environment 126. In some suchexamples, this separate server and/or database may not be a cloud-basedsystem. Further, in some examples, such a non-cloud-based system mayinterface directly with the privacy-protected cloud environment 106 suchthat the AME proprietary cloud environment 126 may be omitted entirely.

In some examples, there may be multiple different techniques and/ormethodologies used to collect the AME panel data that depends on theparticular circumstances involved. For example, different monitoringtechniques and/or different types of audience measurement meters 115 maybe employed for media accessed via a desktop computer relative to themedia accessed via a mobile computing device. In some examples, theaudience measurement meter 115 may be implemented as a softwareapplication that panelists agree to install on their devices to monitorall Internet usage activity on the respective devices. In some examples,the meter 115 may prompt a user of a particular device to identifythemselves so that the AME 104 can confirm the identity of the user(e.g., whether it was the mother or daughter in a panelist household).In some examples, prompting a user to self-identify may be consideredoverly intrusive. Accordingly, in some such examples, the circumstancessurrounding the behavior of the user of a panelist client device 112(e.g., time of day, type of content being accessed, etc.) may beanalyzed to infer the identity of the user to some confidence level(e.g., the accessing of children's content in the early afternoon wouldindicate a relatively high probability that a child is using the deviceat that point in time). In some examples, the audience measurement meter115 may be a separate hardware device that is in communication with aparticular panelist client device 112 and enabled to monitor theInternet usage of the panelist client device 112.

In some examples, the processes and/or techniques used by the AME 104 tocapture panel data (including media impressions and who in particularwas exposed to the media) can differ depending on the nature of thepanelist client device 112 through which the media was accessed. Forinstance, in some examples, the identity of the individual using theclient device 112 may be based on the individual responding to a promptto self-identify. In some examples, such prompts are limited to desktopclient devices because such a prompt is viewed as overly intrusive on amobile device. However, without specifically prompting a user of amobile device to self-identify, there often is no direct way todetermine whether the user is the primary user of the device (e.g., theowner of the device) or someone else (e.g., a child of the primaryuser). Thus, there is the possibility of misattribution of mediaimpressions within the panel data collected using mobile devices. Insome examples, to overcome the issue of misattribution in the paneldata, the AME 104 may develop a machine learning model that can predictthe true user of a mobile device (or any device for that matter) basedon information that the AME 104 does know for certain and/or has accessto. For example, inputs to the machine learning model may include thecomposition of the panelist household, the type (e.g., genre and/orcategory) of the content, the daypart or time of day when the contentwas accessed, etc. In some examples, the truth data used to generate andvalidate such a model may be collected through field surveys in whichthe above input features are tracked and/or monitored for a subset ofpanelists that have agreed to be monitored in this manner (which is moreintrusive than the typical passive monitoring of content accessed viamobile devices).

As mentioned above, in some examples, the AME panel data (stored in theAME panel data database 122) is merged with the database proprietorimpressions data (stored in the matchable impressions database 118)within the privacy-protected cloud environment 106 to take advantage ofthe combination of the disparate sets of data to generate more robustand/or reliable audience measurement metrics. In particular, thedatabase proprietor impressions data provides the advantage of volume.That is, the database proprietor impressions data corresponds to a muchlarger number of impressions than the AME panel data because thedatabase proprietor impressions data includes census wide impressioninformation that includes all impressions collected from both thepanelist client devices 112 (associated with a relatively small pool ofaudience members) and the non-panelist client devices 114. The AME paneldata provides the advantage of high-quality demographic data for astatistically significant pool of audience members (e.g., panelists)that may be used to correct for errors and/or biases in the databaseproprietor impressions data.

One source of error in the database proprietor impressions data is thatthe demographic information for matchable users collected by thedatabase proprietor 102 during user registration may not be truthful. Inparticular, in some examples, many database proprietors impose agerestrictions on their user accounts (e.g., a user must be at least 13years of age, at least 18 years of age, etc.). However, when a personregisters with the database proprietor 102, the user typicallyself-declares their age and may, therefore, lie about their age (e.g.,an 11-year-old may say they are 18 years old to bypass the agerestrictions for a user account). Independent of age restrictions, aparticular user may choose to enter an incorrect age for any otherreason or no reason at all (e.g., a 44-year-old may choose to assertthey are only 25 years old). Where a database proprietor 102 does notverify the self-declared age of users, there is a relatively highlikelihood that the ages of at least some registered users of thedatabase proprietor stored in the matchable impressions database 118 (asa particular user-based covariate) are inaccurate. Further, it ispossible that other self-declared demographic information (e.g., gender,race, ethnicity, income level, etc.) may also be falsified by usersduring registration. In some examples, demographic information for someregistered users may be missing (e.g., registered users elect to notsubmit/declare certain demographic information). Mis-represented and/ormissing demographic information from subscriber accounts of registeredusers of the database proprietor 102 results in inaccuratedemographic-based audience measurements for accessed media. As describedfurther below, the AME panel data (which contains reliable demographicinformation about the panelists) can be used to correct for inaccuratedemographic information in the database proprietor impressions data.

Another source of error in the database proprietor impressions data isbased on the concept of misattribution, which arises in situations wheremultiple different people use the same client device 110 to accessmedia. In some examples, the database proprietor 102 associates aparticular impression to a particular user based on the user beingsigned into a platform provided by the database proprietor. For example,if a particular person signs into their Google account and beginswatching a YouTube video on a particular client device 110, that personwill be attributed with an impression for an ad served during the videobecause the person was signed in at the time. However, there may beinstances where the person finishes using the client device 110 but doesnot sign out of his or her Google account. Thereafter, a seconddifferent person (e.g., a different member in the family of the firstperson) begins using the client device 110 to view another YouTubevideo. Although the second person is now accessing media via the clientdevice 110, ad impressions during this time will still be attributed tothe first person because the first person is the one who is stillindicated as being signed in. Thus, there is likely to be circumstanceswhere the actual person exposed to media 108 is misattributed to adifferent registered user of the database proprietor 102. The AME paneldata (which includes an indication of the actual person using thepanelist client devices 112 at any given moment) can be used to correctfor misattribution in the demographic information in the databaseproprietor impressions data. As mentioned above, in some situations, theAME panel data may itself include misattribution errors. Accordingly, insome examples, the AME panel data may first be corrected formisattribution before the AME panel data is used to correctmisattribution in the database proprietor impressions data. An examplemethodology to correct for misattribution in the database proprietorimpressions data is described in Singh et al., U.S. Pat. No. 10,469,903,which is hereby incorporated herein by reference in its entirety.

Additionally, examples disclosed herein use covariates associated withuser data from a database proprietor (e.g., streaming browsing categorydata, search browsing category data, hours of the day the user isactive, etc.) to generate the computer-trained machine learning modelsto correct for misattribution in the database proprietor impressionsdata. Examples disclosed herein generate a plurality of machine learningmodels with varying combinations of features and ranges forhyperparameters and run the different combinations of computer-generatedmachine learning models in parallel using the user data from thedatabase proprietor. Example disclosed herein determine performanceresults for the different computer-generated machine learning models andselect one or more of the computer-generated machine learning modelsbased on the performance results to use in correcting thecomputer-generated misattribution.

Another problem with the database proprietor impressions data is that ofnon-coverage. Non-coverage refers to impressions recorded by thedatabase proprietor 102 that cannot be matched to a particularregistered user of the database proprietor 102. The inability of thedatabase proprietor 102 to match a particular impression to a particularuser can occur for several reasons including that the user is not signedin at the time of the media impression, that the user has notestablished an account with the database proprietor 102, that the userhas enabled Limited Ad Tracking (LAT) to prevent the user account frombeing associated with ad impressions, or that the content associatedwith the media being monitored corresponds to children's content (forwhich user-based tracking is not performed). While the inability of thedatabase proprietor 102 to match and assign a particular impression to aparticular user is not necessarily an error in the database proprietorimpressions data, it does undermine the ability to reliably estimate thetotal unique audience size for (e.g., the number of unique individualsthat were exposed to) a particular media item. For example, assume thatthe database proprietor 102 records a total of 11,000 impressions formedia 108 in a particular advertising campaign. Further assume that ofthose 11,000 impressions, the database proprietor 102 is able to match10,000 impressions to a total of 5,000 different users (e.g., each userwas exposed to the media on average 2 times) but is unable to match theremaining 1,000 impressions to particular users. Relying solely on thedatabase proprietor impressions data, in this example, there is no wayto determine whether the remaining 1,000 impressions should also beattributed to the 5,000 users already exposed at least once to the media108 (for a total audience size of 5,000 people) or if one or more of theremaining 1,000 impressions should be attributed to other users notamong the 5,000 already identified (for a total audience size of up to6,000 people (if every one of the 1,000 impressions was associated witha different person not included in the matched 5,000 users)). In someexamples disclosed herein, the AME panel data can be used to estimatethe distribution of impressions across different users associated withthe non-coverage portion of impressions in the database proprietorimpressions data to thereby estimate a total audience size for therelevant media 108. In some examples disclosed herein, a plurality ofcomputer-generated machine learning models with different combinationsof features and hyperparameters are run using the user data from thedatabase proprietor. Examples disclosed herein determine performanceresults for the different computer-generated machine learning modelsbased on a comparison of the results of the computer-generated machinelearning models and the audience data associated audience measurementpanelists from the AME. Examples disclosed herein select one or more ofthe computer-generated machine learning models based on the performanceresults to use in correcting the computer-generated errors in audiencedata.

Another confounding factor to the estimation of the total uniqueaudience size for media based on the database proprietor impressionsdata is the existence of multiple user accounts of a single user. Moreparticular, in some situations a particular individual may establishmultiple accounts with the database proprietor 102 for differentpurposes (e.g., a personal account, a work account, a joint accountshared with other individuals, etc.). Such a situation can result in alarger number of different users being identified as audience members tomedia 108 than the actual number of individuals exposed to the media108. For example, assume that a particular person registers three useraccounts with the database proprietor 102 and is exposed to the media108 once while signed into each of the three different accounts for atotal of three impressions. In this scenario, the database proprietor102 would match each impression to a different user based on thedifferent user accounts making it appear that three different peoplewere exposed to the media 108 when, in fact, only one person was exposedto the media three different times. Examples disclosed herein use theAME panel data in conjunction with the database proprietor impressionsdata to estimate an actual unique audience size from the potentiallyinflated number of apparently unique users exposed to the media 108.

In the illustrated example of FIG. 1, the AME panel data is merged withthe database proprietor impressions data by an example data matchinganalyzer 128. In some examples, the data matching analyzer 128implements an application programming interface (API) that takes thedisparate datasets and matches users in the database proprietorimpressions data with panelists in the AME panel data. In some examples,users are matched with panelists based on the unique impressionidentifiers (e.g., CPNs) collected in connection with the mediaimpressions logged by both the database proprietor 102 and the AME 104.The combined data is stored in an intermediary merged data database 130within an AME privacy-protected data store 132. The data in theintermediary merged data database 130 is referred to as “intermediary”because it is at an intermediate stage in the processing because itincludes AME panel data that has been enhanced and/or combined with thedatabase proprietor impressions data but has not yet be corrected oradjusted to account for the sources of error and/or bias in the databaseproprietor impressions data as outlined above.

In some examples, the AME intermediary merged data is analyzed by anadjustment factor analyzer 134 to calculate adjustment or calibrationfactors that may be stored in an adjustment factors database 136 withinan AME output data store 138 of the AME proprietary cloud environment126. In some examples, the adjustment factor analyzer 134 calculatesdifferent types of adjustment factors to account for different types oferrors and/or biases in the database proprietor impressions data. Forinstance, a multi-account adjustment factor corrects for the situationof a single user accessing media using multiple different user accountsassociated with the database proprietor 102. A signed-out adjustmentfactor corrects for non-coverage associated with users that access mediawhile signed out of their account associated with the databaseproprietor 102 (so that the database proprietor 102 is unable toassociate the impression with the users). In some examples, theadjustment factor analyzer 134 is able to directly calculate themulti-account adjustment factor and the signed-out adjustment factor ina deterministic manner.

While the multi-account adjustment factors and the signed-out adjustmentfactors may be deterministically calculated, correcting for falsified orotherwise incorrect demographic information (e.g., incorrectlyself-declared ages) of registered users of the database proprietor 102cannot be solved in such a direct and deterministic manner. Rather, insome examples, computer-generated machine learning models are developedto analyze and predict the correct demographics (e.g., ages) ofregistered users of the database proprietor 102. Specifically, as shownin FIG. 1, the privacy-protected cloud environment 106 implements amodel generator 140 to generate computer-generated machine learningmodels using the AME intermediary merged data (stored in the AMEintermediary merged data database 130) as inputs. More particularly, insome examples, self-declared demographics (e.g., the self-declared age)of users of the database proprietor 102, along with other covariatesassociated with the users, are used as the input variables or featuresused to train the computer-generated machine learning models to predictthe correct demographics (e.g., correct age) of the users as validatedby the AME panel data, which serves as the truth data or training labelsfor the model generation. The example model generator 140 determinesperformance results for the different computer-generated machinelearning models (e.g., model accuracy, demographic data accuracy, etc.)based on a comparison of the results of the computer-generated machinelearning models and the audience data associated with audiencemeasurement panelists from the AME. After the differentcomputer-generated machine learning models have been trained andvalidated based on the AME panel data, the computer-generated machinelearning models and associated performance results are stored in anexample demographic correction models database 142. In some examples,different demographic correction model(s) may be developed to correctfor different types of demographic information that needs correcting.For instance, in some examples, a first model can be used to correct theself-declared age of users of the database proprietor 102 and a secondmodel can be used to correct the self-declared gender of the users.

As mentioned above, there are many different types of covariatescollected and/or generated by the database proprietor 102. In someexamples, the covariates provided by the database proprietor 102 mayinclude a certain number (e.g., 100) of the top search result clickentities and/or video watch entities for every user during a most recentperiod of time (e.g., for the last month). These entities are integeridentifiers (IDs) that map to a knowledge graph of all entities for thesearch result clicks and/or videos watched. That is, as used in thiscontext, an entity corresponds to a particular node in a knowledge graphmaintained by the database proprietor 102. In some examples, the totalnumber of unique IDs in the knowledge graph may number in the tens ofmillions. More particularly, for example, YouTube videos are classifiedacross roughly 20 million unique video entity IDs and Google searchresults are classified across roughly 25 million unique search resultentity IDs. In addition to the top search result click entities and/orvideo watch entities, the database proprietor 102 may also provideembeddings for these entities. An embedding is a numericalrepresentation (e.g., a vector array of values) of some class of similarobjects, images, words, and the like. For example, a particular userthat frequently searches for and/or views cat videos may be associatedwith a feature embedding representative of the class corresponding tocats. Thus, feature embeddings translate relatively high dimensionalvectors of information (e.g., text strings, images, videos, etc.) into alower dimensional space to enable the classification of different butsimilar objects.

In some examples, multiple embeddings may be associated with each searchresult click entity and/or video watch entity. Accordingly, assuming thetop 100 search result entities and video watch entities are providedamong the covariates and that 16-dimension embeddings are provided foreach such entity, this results in a 100×16 matrix of values for everyuser, which may be too much data to process during generation of thedemographic correction models as described above. Accordingly, in someexamples, the dimensionality of the matrix is reduced to a moremanageable size to be used as an input feature for the demographiccorrection model generation.

In some examples, a process is implemented to track differentcomputer-generated machine learning model experiments over time toachieve high quality (e.g., accurate) models and also for auditingpurposes. Accomplishing this objective within the context of theprivacy-protected cloud environment 106 presents several uniquechallenges because the model features (e.g., inputs and hyperparameters)and model performance (e.g., accuracy) are stored separately to satisfythe privacy constraints of the environment.

In some examples, a model analyzer 144 may implement and/or use one ormore computer-generated machine learning models to generate predictionsand/or inferences as to the actual demographics (e.g., actual ages) ofusers associated with media impressions logged by the databaseproprietor 102. In examples disclosed herein, the model analyzer 144obtains the computer-generated machine learning models and associatedperformance results stored in the demographic correction models database142. The example model analyzer 144 aggregates the performance resultsfor the computer-generated machine learning models and selects one ormore of the computer-generated machine learning models based on theperformance results to use in correcting human errors (e.g., errors fromusers self-declaring/entering inaccurate demographic information suchas, age, gender, etc.) and/or computer-generated errors (e.g.,misattribution error(s), non-coverage error(s), etc.) in the demographicinformation for users associated with the impressions from the databaseproprietor 102. In some examples, as shown in FIG. 1, the model analyzer144 uses the selected computer-generated machine learning model(s) fromthe demographic correction models database 142 to analyze theimpressions in the enriched impressions database 120 that were matchedto a particular user of the database proprietor 102. The inferreddemographic (e.g., age) for each user may be stored in a modelinferences database 146 for subsequent use, retrieval, and/or analysis.Additionally or alternatively, in some examples, the model analyzer 144uses the selected computer-generated machine learning model(s) from thedemographic correction models database 142 to analyze the entire userbase of the database proprietor 102 regardless of whether the users arematched to any particular media impressions. After inferring the correctdemographic (e.g., age) for each user, the inferences are stored in themodel inferences database 146. In some such examples, when the usersmatched to particular impressions are to be analyzed (e.g., the usersmatched to impressions in the enriched impressions database 120), themodel analyzer 144 merely extracts the inferred demographic assignmentto each relevant user in the enriched impressions database 120 thatmatches with one or more media impressions.

As described above, in some examples, the database proprietor 102 mayidentify a particular user as corresponding to a particular impressionbased on the user being signed into the database proprietor 102.However, there are circumstances where the individual corresponding tothe user account is not the actual person that was exposed to therelevant media. Accordingly, merely inferring a correct demographic(e.g., age) of the user associated with the signed in user account maynot be the correct demographic of the actual person to which aparticular media impression should be attributed. In other words,whereas the AME panelist data and the database proprietor impressionsdata is matched at the impression level, demographic correction isimplemented at the user level. Therefore, before generating thedemographic correction model, a method to reduce logged impressions toindividual users is first implemented so that the demographic correctionmodel can be reliably implemented.

With inferences made to correct for inaccurate demographic informationof database proprietor users (e.g., falsified self-declared ages) andstored in the model inferences database 146, the AME 104 may beinterested in extracting audience measurement metrics based on thecorrected data. However, as mentioned above, the data contained insidethe privacy-protected cloud environment 106 is subject to privacyconstraints. In some examples, the privacy constraints ensure that thedata can only be extracted for review and/or analysis in aggregate so asto protect the privacy of any particular individual represented in thedata (e.g., a panelist of the AME 104 and/or a registered user of thedatabase proprietor 102). Accordingly, in some examples, a dataaggregator 148 aggregates the audience measurement data associated withparticular media campaigns before the data is provided to an aggregatedcampaign data database 150 in the AME output data store 138 of the AMEproprietary cloud environment 126.

The data aggregator 148 may aggregate data in different ways fordifferent types of audience measurement metrics. For instance, at thehighest level, the aggregated data may provide the total impressioncount and total number of users (e.g., estimated audience size) exposedto the media 108 for a particular media campaign. As mentioned above,the total number of users reported by the data aggregator 148 is basedon the total number of unique user accounts matched to impressions butdoes not include the individuals associated with impressions that werenot matched to a particular user (e.g., non-coverage). However, thetotal number of unique user accounts does not account for the fact thata single individual may correspond to more than one user account (e.g.,multi-account users), and does not account for situations where a personother than a signed-in user was exposed to the media 108 (e.g.,misattribution). These errors in the aggregated data may be correctedbased on the adjustment factors stored in the adjustment factorsdatabase 136. Further, in some examples, the aggregated data may includean indication of the demographic composition of the users represented inthe aggregated data (e.g., number of males vs females, number of usersin different age brackets, etc.).

Additionally or alternatively, in some examples, the data aggregator 148may provide aggregated data that is associated with a particular aspectof a media campaign. For instance, the data may be aggregated based onparticular sites (e.g., all media impressions served on YouTube.com). Inother examples, the data may be aggregated based on placementinformation (e.g., aggregated based on particular primary content videosaccessed by users when the media advertisement was served). In otherexamples, the data may be aggregated based on device type (e.g.,impressions served via a desktop computer versus impressions served viaa mobile device). In other examples, the data may be aggregated based ona combination of one or more of the above factors and/or based on anyother relevant factor(s).

In some examples, the privacy constraints imposed on the data within theprivacy-protected cloud environment 106 include a limitation that datacannot be extracted (even when aggregated) for less than a thresholdnumber of individuals (e.g., 50 individuals). Accordingly, if theparticular metric being sought includes less than the threshold numberof individuals, the data aggregator 148 will not provide such data. Forinstance, if the threshold number of individuals is 50 but there areonly 46 females in the age range of 18-25 that were exposed toparticular media 108, the data aggregator 148 would not provide theaggregate data for females in the 18-25 age bracket. Such privacyconstraints can leave gaps in the audience measurement metrics,particularly in locations where the number of panelists is relativelysmall. Accordingly, in some examples, when audience measurement is notavailable for a particular demographic segment of interest in aparticular region (e.g., a particular country), the audience measurementmetrics in one or more comparable region(s) may be used to impute themetrics for the missing data in the first region of interest. In someexamples, the particular metrics imputed from comparable regions isbased on a comparison of audience metrics for which data is available inboth regions. For instance, while data for females in the 18-25 bracketmay be unavailable, assume that data for females in the 26-35 agebracket is available. The metrics associated with the 26-35 age bracketin the region of interests may be compared with metrics for the 26-35age bracket in other regions and the regions with the closest metrics tothe region of interest may be selected for use in calculating imputationfactor(s).

As shown in the illustrated example, both the adjustment factorsdatabase 136 and the aggregated campaigns data database 150 are includedwithin the AME output data store 138 of the AME proprietary cloudenvironment 126. As mentioned above, in some examples, the AMEproprietary cloud environment 126 is provided by the database proprietor102 and enables data to be provided to and retrieved from theprivacy-protected cloud environment. In some examples, the aggregatedcampaign data and the adjustment factors are subsequently transferred toa separate computing apparatus 152 of the AME 104 for analysis by anaudience metrics analyzer 154. In some examples, the separate computingapparatus may be omitted with its functionality provided by the AMEproprietary cloud environment 126. In other examples, the AMEproprietary cloud environment 126 may be omitted with the adjustmentfactors and the aggregated data provided directly to the computingapparatus 152. Further, in this example, the AME panel data database 122is within the AME first party data store 124, which is shown as beingseparate from the AME output data store 138. However, in other examples,the AME first party data store 124 and the AME output data store 138 maybe combined.

In the illustrated example of FIG. 1, the audience metrics analyzer 154applies the adjustment factors to the aggregated data to correct forerrors in the data including misattribution, non-coverage, andmulti-account users. The output of the audience metrics analyzer 154corresponds to the final calibrated data of the AME 104 and is stored ina final calibrated data database 156. In this example, the computingapparatus 152 also includes a report generator 158 to generate reportsbased on the final calibrated data.

FIG. 2 is a block diagram of the example model generator 140 of FIG. 1.The example model generator 140 of FIG. 2 includes an example featureinterface 202, an example hyperparameter interface 204, an example queryselector 206, and an example query generator 208.

In the illustrated example of FIG. 2, the example feature interface 202accesses candidate features (e.g., from memory, a receive buffer, etc.)to use as inputs for generating the computer-generated machine learningmodels. In some examples, the feature interface 202 accesses covariatesassociated with user data from the database proprietor 102. In examplesdisclosed herein, covariates are used as candidate features forgenerating the computer-generated machine learning models. In someexamples, the covariates represent information collected and/orgenerated by the database proprietor 102 that are used to identify,characterize, quantify, and/or distinguish particular users and/or theirassociated behavior. For example, the covariates may include the name,age, and/or gender of the user (and/or any other demographic informationabout the user) collected at the time the user registered with thedatabase proprietor 102, and/or the relative frequency with which theuser uses the different types of client device 110, the number of mediaitems the user has accessed during a most recent period of time (e.g.,the last 30 days), the search terms entered by the user during a mostrecent period of time (e.g., the last 30 days), feature embeddings(numerical representations) of classifications of videos viewed and/orsearches entered by the user, etc. In some examples, the featureinterface 202 includes a range of candidate features (e.g., 20 to 100candidate features) for the computer-generated machine learning modelsbased on the covariates.

In the illustrated example, the example hyperparameter interface 204accesses candidate hyperparameters (e.g., from memory, a receive buffer,etc.) to use as inputs for generating the computer-generated machinelearning models. In some examples, the hyperparameter interface 204accesses the different candidate hyperparameters available to thecomputer-generated machine learning models from the database proprietor102. In some examples, the candidate hyperparameters available from thedatabase proprietor 102 can include topology of a neural network, sizeof a neural network, learning rate of the neural network, batch size ofthe neural network, etc. In some examples, the hyperparameter interface204 accesses eight different hyperparameters with adjustable values fromthe database proprietor 102 to input in the computer-generated machinelearning models.

In the illustrated example of FIG. 2, the example query selector 206selects a plurality of features from the example feature interface 202and selects a range of hyperparameters from the example hyperparameterinterface 204. The example query selector 206 selects features from thefeature interface 202 on which to run different combinations of machinelearning models. In some examples, the query selector 206 selects eightdifferent features from the candidate features via the feature interface202 for each of the computer-generated machine learning models. Forexample, the query selector 206 selects a first set of eight featuresfrom the candidate features for a first computer-generated machinelearning model and selects a second set of eight features from thecandidate features for a second computer-generated machine learningmodel, where the first set of eight features and the second set of eightfeatures may differ by any number of features (e.g., one differentfeature, eight different features, etc.). However, the example queryselector 206 can select any number of different features for thecomputer-generated machine learning models. The example query selector206 selects ranges for the hyperparameters obtained via thehyperparameter interface 204 to provide to the computer-generatedmachine learning models. The example query selector 206 selectsdifferent combinations of the features via the feature interface 202 andranges of the hyperparameters via the hyperparameter interface 204 foreach of the different machine learning models. In some examples, thequery selector 206 selects a plurality of combinations of features andhyperparameters to be used to correct computer-generated data errors asdisclosed herein. In some examples, the plurality of combinations offeatures and hyperparameters include every possible combination offeatures and hyperparameters available via the example feature interface202 and the example hyperparameter interface 204. In such examples, thedifferent combinations of candidate features and hyperparameters areevaluated using the computer-generated machine learning models so thatall of the generated results of those models can be evaluated relativeto one another. In this manner, one or more of the combinations ofcandidate features and hyperparameters can be selected based on suchperformance evaluations (e.g., select one or more of the combinations ofcandidate features and hyperparameters that achieve relatively betterperformance than other ones of the combinations of candidate featuresand hyperparameters) for use in correcting any human-generated errors(e.g., errors from users self-declaring inaccurate demographicinformation such as, age, gender, etc.) and/or any computer-generatederrors in the user demographic information (e.g., misattribution,non-coverage, etc.).

The example query generator 208 generates a plurality of differentcomputer-generated machine learning models based on the plurality ofselected combinations of the sets of features and ranges ofhyperparameters from the example query selector 206. The example querygenerator 208 initiates the training of the plurality ofcomputer-generated machine learning models based on demographic data ofaudience measurement panelists that are also subscribers of the databaseproprietor 102. In the illustrated example, the demographic data isobtained from user accounts of the database proprietor 102 in theprivacy-protected cloud environment 106. In some examples, the querygenerator 208 uses the demographic data of the users from the databaseproprietor 102, the selected features, and the selected ranges ofhyperparameters to train the plurality of computer-generated machinelearning models. In some examples, the query generator 208 triggers thetraining of the computer-generated machine learning models in parallel.In some examples, manually running and training only one of thecomputer-generated machine learning models could take up to ten minutes.However, in the illustrated example, the example query generator 208triggers the running and training of all of the computer-generatedmachine learning models in parallel to allow for running and trainingall of the computer-generated machine learning models in about tenminutes. The example query generator 208 generates the demographicresults of each of the computer-generated machine learning models basedon the training and running of the computer-generated machine learningmodels.

In the illustrated example of FIG. 2, the example analytics controller210 generates performance results for all of the computer-generatedmachine learning models from the example query generator 208. Theexample analytics controller 210 compares the results (e.g., demographicinformation for users) from the training of the plurality ofcomputer-generated machine learning models to the demographic data fromaudience measurement panelists from the AME panel data. In someexamples, the demographic data from audience measurement panelists fromthe AME panel data is used to validate the demographic results of thecomputer-generated machine learning models (e.g., the AME panel dataserves as the truth data). In some examples, the analytics controller210 obtains demographic data of the audience measurement panelist whoaccess media via panelist client devices that correspond to thesubscribers of the database proprietor 102 to determine the performanceof each of the computer-generated machine learning models. In someexamples, the performance results include model accuracy, demographicaccuracy, etc. In some examples, the example analytics controller 210stores the computer-generated machine learning models and thecorresponding performance results in the example demographic correctionmodels database 142 of FIG. 1.

FIG. 3 is a block diagram illustrating the example model analyzer 144 ofFIG. 1. The example model analyzer 144 of FIG. 3 includes an examplequery results interface 302, an example data aggregation controller 304,and an example model selector 306.

The example query results interface 302 obtains the computer-generatedmachine learning models and corresponding performance results from theexample analytics controller 210 of FIG. 2. In some examples, theexample query results interface 302 obtains the computer-generatedmachine learning models and corresponding performance results stored inthe example demographic correction models database 142 of FIG. 1.

In the illustrated example of FIG. 3, the example data aggregationcontroller 304 runs a query to merge the individual performance resultsfor the computer-generated machine learning models to aggregate theperformance results. The example data aggregation controller 304generates aggregate results of the performance results from theplurality of computer-generated machine learning models. In someexamples, the data aggregation controller 304 aggregates the performanceresults into a tabular format that is easier to read and interpret whenanalyzing the performance results of the computer-generated machinelearning models.

In the illustrated example of FIG. 3, the example model selector 306compares the performance results of the computer-generated machinelearning models based on the aggregated performance results from theexample data aggregation controller 304. The example model selector 306selects one of the computer-generated machine learning models based onthe performance results. In some examples, the example model selector306 compares the performance results across the computer-generatedmachine learning models to select a computer-generated machine learningmodel with the relatively best performance. In some examples, theexample model selector 306 determines a computer-generated machinelearning model has relatively the best performance when the aggregatedperformance results for the computer-generated machine learning modelare higher in value (e.g., a higher value for model accuracy, a highervalue for demographic accuracy, etc.) compared to the aggregatedperformance results for the remaining computer-generated machinelearning models. In some examples, the example model selector 306determines a computer-generated machine learning model has relativelythe best performance based on a combination of all of the aggregatedperformance results for the computer-generated machine learning model.In some examples, the different performance results for thecomputer-generated machine learning model can be weighted together todetermine a computer-generated machine learning model with relativelythe best performance. In some examples, the example model selector 306determines a computer-generated machine learning model with therelatively best performance based on a combination of the performancemetrics determined by the example analytics controller 210 of FIG. 2(e.g., model accuracy, demographic accuracy, etc.). The example modelselector 306 determines which combination of features andhyperparameters yielded the relatively best performance results amongthe computer-generated machine learning models. The example modelselector 306 enables the comparison of the performance results from allof the computer-generated machine learning models in a fraction of thetime relative to comparing the performance results manually.

In some examples, the example model selector 306 applies the selectedcomputer-generated machine learning model to the impressions from thedatabase proprietor 102 to correct for any human errors (e.g., errorsfrom users self-declaring inaccurate demographic information such as,age, gender, etc.) and/or any computer-generated errors in the userdemographic information (e.g., misattribution, non-coverage, etc.). Insome examples, the example model selector 306 uses the selectedcomputer-generated machine learning model to analyze the impressions inthe example enriched impressions database 120 of FIG. 1 that werematched to a particular user of the database proprietor 102. Thedemographic information (e.g., age, gender, etc.) from applying theselected computer-generated machine learning for each user may be storedin the example model inferences database 146 of FIG. 1 for subsequentuse, retrieval, and/or analysis. In some examples, the example modelselector 306 uses the selected computer-generated machine learning modelto analyze the entire user base of the database proprietor 102regardless of whether the users are matched to any particular mediaimpressions. After inferring the correct demographic information (e.g.,age, gender, etc.) for each user using the selected computer-generatedmachine learning model, the example model selector 306 stores theinferences in the example model inferences database 146 of FIG. 1. Insome such examples, when the users matched to particular impressions areto be analyzed (e.g., the users matched to impressions in the enrichedimpressions database 120), the example model selector 306 extracts theinferred demographic assignment to each relevant user in the enrichedimpressions database 120 that matches with one or more mediaimpressions. In some examples, the model selector 306 can infer otherinformation in addition to or instead of the demographic information foreach user using the selected computer-generated machine learning modelto correct for any other human errors and/or any computer-generatederrors in the media impressions.

While example manners of implementing the model generator 140 and themodel analyzer 144 of FIG. 1 are illustrated in FIGS. 2 and 3, one ormore of the elements, processes and/or devices illustrated in FIGS. 2and 3 may be combined, divided, re-arranged, omitted, eliminated and/orimplemented in any other way. Further, the example feature interface202, the example hyperparameter interface 204, the example queryselector 206, the example query generator 208, the example analyticscontroller 210, the example query results interface 302, the exampledata aggregation controller 304, the example model selector 306 and/or,more generally, the example model generator 140 and the example modelanalyzer 144 of FIGS. 2 and 3 may be implemented by hardware, software,firmware and/or any combination of hardware, software and/or firmware.Thus, for example, any of the example feature interface 202, the examplehyperparameter interface 204, the example query selector 206, theexample query generator 208, the example analytics controller 210, theexample query results interface 302, the example data aggregationcontroller 304, the example model selector 306 and/or, more generally,the example model generator 140 and the example model analyzer 144 couldbe implemented by one or more analog or digital circuit(s), logiccircuits, programmable processor(s), programmable controller(s),graphics processing unit(s) (GPU(s)), digital signal processor(s)(DSP(s)), application specific integrated circuit(s) (ASIC(s)),programmable logic device(s) (PLD(s)) and/or field programmable logicdevice(s) (FPLD(s)). When reading any of the apparatus or system claimsof this patent to cover a purely software and/or firmwareimplementation, at least one of the example feature interface 202, theexample hyperparameter interface 204, the example query selector 206,the example query generator 208, the example analytics controller 210,the example query results interface 302, the example data aggregationcontroller 304, and/or the example model selector 306 is/are herebyexpressly defined to include a non-transitory computer readable storagedevice or storage disk such as a memory, a digital versatile disk (DVD),a compact disk (CD), a Blu-ray disk, etc. including the software and/orfirmware. Further still, the example model generator 140 and the examplemodel analyzer 144 of FIGS. 2 and 3 may include one or more elements,processes and/or devices in addition to, or instead of, thoseillustrated in FIGS. 2 and 3, and/or may include more than one of any orall of the illustrated elements, processes and devices. As used herein,the phrase “in communication,” including variations thereof, encompassesdirect communication and/or indirect communication through one or moreintermediary components, and does not require direct physical (e.g.,wired) communication and/or constant communication, but ratheradditionally includes selective communication at periodic intervals,scheduled intervals, aperiodic intervals, and/or one-time events.

Flowcharts representative of example hardware logic, machine readableinstructions, hardware implemented state machines, and/or anycombination thereof for implementing the model generator 140 and themodel analyzer 144 of FIGS. 2 and 3 are shown in FIGS. 4 and 5. Themachine readable instructions may be one or more executable programs orportion(s) of an executable program for execution by a computerprocessor and/or processor circuitry, such as the processor 612 shown inthe example processor platform 600 discussed below in connection withFIG. 6. The program may be embodied in software stored on anon-transitory computer readable storage medium such as a CD-ROM, afloppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associatedwith the processor 612, but the entire program and/or parts thereofcould alternatively be executed by a device other than the processor 612and/or embodied in firmware or dedicated hardware. Further, although theexample program is described with reference to the flowchartsillustrated in FIGS. 4 and 5, many other methods of implementing theexample model generator 140 and the example model analyzer 144 mayalternatively be used. For example, the order of execution of the blocksmay be changed, and/or some of the blocks described may be changed,eliminated, or combined. Additionally or alternatively, any or all ofthe blocks may be implemented by one or more hardware circuits (e.g.,discrete and/or integrated analog and/or digital circuitry, an FPGA, anASIC, a comparator, an operational-amplifier (op-amp), a logic circuit,etc.) structured to perform the corresponding operation withoutexecuting software or firmware. The processor circuitry may bedistributed in different network locations and/or local to one or moredevices (e.g., a multi-core processor in a single machine, multipleprocessors distributed across a server rack, etc.).

The machine readable instructions described herein may be stored in oneor more of a compressed format, an encrypted format, a fragmentedformat, a compiled format, an executable format, a packaged format, etc.Machine readable instructions as described herein may be stored as dataor a data structure (e.g., portions of instructions, code,representations of code, etc.) that may be utilized to create,manufacture, and/or produce machine executable instructions. Forexample, the machine readable instructions may be fragmented and storedon one or more storage devices and/or computing devices (e.g., servers)located at the same or different locations of a network or collection ofnetworks (e.g., in the cloud, in edge devices, etc.). The machinereadable instructions may require one or more of installation,modification, adaptation, updating, combining, supplementing,configuring, decryption, decompression, unpacking, distribution,reassignment, compilation, etc. in order to make them directly readable,interpretable, and/or executable by a computing device and/or othermachine. For example, the machine readable instructions may be stored inmultiple parts, which are individually compressed, encrypted, and storedon separate computing devices, wherein the parts when decrypted,decompressed, and combined form a set of executable instructions thatimplement one or more functions that may together form a program such asthat described herein.

In another example, the machine readable instructions may be stored in astate in which they may be read by processor circuitry, but requireaddition of a library (e.g., a dynamic link library (DLL)), a softwaredevelopment kit (SDK), an application programming interface (API), etc.in order to execute the instructions on a particular computing device orother device. In another example, the machine readable instructions mayneed to be configured (e.g., settings stored, data input, networkaddresses recorded, etc.) before the machine readable instructionsand/or the corresponding program(s) can be executed in whole or in part.Thus, machine readable media, as used herein, may include machinereadable instructions and/or program(s) regardless of the particularformat or state of the machine readable instructions and/or program(s)when stored or otherwise at rest or in transit.

The machine readable instructions described herein can be represented byany past, present, or future instruction language, scripting language,programming language, etc. For example, the machine readableinstructions may be represented using any of the following languages: C,C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language(HTML), Structured Query Language (SQL), Swift, etc.

As mentioned above, the example processes of FIGS. 4 and 5 may beimplemented using executable instructions (e.g., computer and/or machinereadable instructions) stored on a non-transitory computer and/ormachine readable medium such as a hard disk drive, a flash memory, aread-only memory, a compact disk, a digital versatile disk, a cache, arandom-access memory and/or any other storage device or storage disk inwhich information is stored for any duration (e.g., for extended timeperiods, permanently, for brief instances, for temporarily buffering,and/or for caching of the information). As used herein, the termnon-transitory computer readable medium is expressly defined to includeany type of computer readable storage device and/or storage disk and toexclude propagating signals and to exclude transmission media.

“Including” and “comprising” (and all forms and tenses thereof) are usedherein to be open ended terms. Thus, whenever a claim employs any formof “include” or “comprise” (e.g., comprises, includes, comprising,including, having, etc.) as a preamble or within a claim recitation ofany kind, it is to be understood that additional elements, terms, etc.may be present without falling outside the scope of the correspondingclaim or recitation. As used herein, when the phrase “at least” is usedas the transition term in, for example, a preamble of a claim, it isopen-ended in the same manner as the term “comprising” and “including”are open ended. The term “and/or” when used, for example, in a form suchas A, B, and/or C refers to any combination or subset of A, B, C such as(1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) Bwith C, and (7) A with B and with C. As used herein in the context ofdescribing structures, components, items, objects and/or things, thephrase “at least one of A and B” is intended to refer to implementationsincluding any of (1) at least one A, (2) at least one B, and (3) atleast one A and at least one B. Similarly, as used herein in the contextof describing structures, components, items, objects and/or things, thephrase “at least one of A or B” is intended to refer to implementationsincluding any of (1) at least one A, (2) at least one B, and (3) atleast one A and at least one B. As used herein in the context ofdescribing the performance or execution of processes, instructions,actions, activities and/or steps, the phrase “at least one of A and B”is intended to refer to implementations including any of (1) at leastone A, (2) at least one B, and (3) at least one A and at least one B.Similarly, as used herein in the context of describing the performanceor execution of processes, instructions, actions, activities and/orsteps, the phrase “at least one of A or B” is intended to refer toimplementations including any of (1) at least one A, (2) at least one B,and (3) at least one A and at least one B.

As used herein, singular references (e.g., “a”, “an”, “first”, “second”,etc.) do not exclude a plurality. The term “a” or “an” item, as usedherein, refers to one or more of that item. The terms “a” (or “an”),“one or more”, and “at least one” can be used interchangeably herein.Furthermore, although individually listed, a plurality of means,elements or method actions may be implemented by, e.g., a single unit orprocessor. Additionally, although individual features may be included indifferent examples or claims, these may possibly be combined, and theinclusion in different examples or claims does not imply that acombination of features is not feasible and/or advantageous.

FIG. 4 is a flowchart representative of machine readable instructions400 which may be executed to implement the example model generator 140of FIGS. 1 and/or 2. The example instructions 400 begin at block 402 atwhich the example query selector 206 (FIG. 2) selects features on whichto run different combination of models. In some examples, the examplequery selector 206 selects a plurality of features via the examplefeature interface 202 of FIG. 2. In some examples, the query selector206 selects eight different features from the candidate features via thefeature interface 202 for each of the computer-generated machinelearning models. For example, the query selector 206 selects a first setof eight features from the candidate features for a firstcomputer-generated machine learning model and selects a second set ofeight features from the candidate features for a secondcomputer-generated machine learning, where one or more of the first setof eight features may differ in type from one or more of the second setof eight features (e.g., one different feature type, eight differentfeature types, etc.). However, the example query selector 206 can selectany number of different features for the computer-generated machinelearning models.

At block 404, the example query selector 206 selects a range for thehyperparameters to provide the models. In some examples, the examplequery selector 206 selects a range of hyperparameters via the examplehyperparameter interface 204. The example query selector 206 selectsranges for the hyperparameters via the hyperparameter interface 204 toprovide the computer-generated machine learning models. The examplequery selector 206 selects different combinations of the features fromthe feature interface 202 and ranges of the hyperparameters from thehyperparameter interface 204 for each of the different machine learningmodels. In some examples, the query selector 206 selects every possiblecombination of features and hyperparameters available from the examplefeature interface 202 and the example hyperparameter interface 204.

At block 406, the example query generator 208 (FIG. 2) generatesdifferent models from combinations of the set of features and range ofhyperparameters. In some examples, the example query generator 208generates a plurality of different computer-generated machine learningmodels based on the plurality of selected combinations of the featuresand ranges of hyperparameters from the example query selector 206. Atblock 408, the example query generator 208 triggers parallel training ofthe models. In some examples, the example query generator 208 initiatesthe training of the plurality of computer-generated machine learningmodels based on demographic data of audience measurement panelists thatare also subscribers of the database proprietor 102. In the illustratedexample, the demographic data is obtained from user accounts of thedatabase proprietor 102 in the privacy-protected cloud environment 106.In some examples, the query generator 208 uses the demographic data ofthe users from the database proprietor 102, the selected features, andthe selected ranges of hyperparameters to train the plurality ofcomputer-generated machine learning models. In some examples, the querygenerator 208 triggers the training of the computer-generated machinelearning models to occur on a computer, server, device, etc. In someexamples, the query generator 208 triggers the training of thecomputer-generated machine learning models by sending a command,instruction, network communication, etc. to the computer, server,device, etc. In some examples, the query generator 208 triggers thetraining of the computer-generated machine learning models in parallel.

At block 410, the example query generator 208 determines if the modelshave finished training. If at block 410 the example query generator 208determines the models have not finished training, the instructions 400remain at block 410 and wait for the example query generator 208 todetermine the models have finished training. If at block 410 the examplequery generator 208 determines the models have finished training, theinstructions 410 continue to block 412 at which the example analyticscontroller 210 (FIG. 2) generates performance results for all of themodels. In some examples, the example analytics controller 210 (FIG. 2)generates performance results for all of the computer-generated machinelearning models from the example query generator 208. The exampleanalytics controller 210 compares the results (e.g., demographicinformation for users, etc.) from the training of the plurality ofcomputer-generated machine learning models to the demographic data fromaudience measurement panelists from the AME panel data. In someexamples, the demographic data from audience measurement panelists fromthe AME panel data is used to validate (e.g., the AME panel data servesas the truth data) the demographic results of the computer-generatedmachine learning models. In some examples, the analytics controller 210obtains demographic data of the audience measurement panelist whoaccessed media via panelist client devices that correspond to the usersof the database proprietor 102 to determine the performance of each ofthe computer-generated machine learning models. In some examples, theperformance results include model accuracy, demographic accuracy, etc.

At block 414, the example analytics controller 210 stores theperformance results of the models. In some examples, the exampleanalytics controller 210 stores the computer-generated machine learningmodels and the corresponding performance results in the exampledemographic correction models database 142 of FIG. 1. After the exampleanalytics controller 210 stores the performance results of the models,the instructions 400 of FIG. 4 end.

FIG. 5 is a flowchart representative of machine readable instructions500 which may be executed to implement the example model analyzer 144 ofFIGS. 1 and/or 3. In some examples, the instructions 500 are executed bythe same computer or machine that executes the instructions 400. Inother examples, the instructions 500 are executed by a separate computeror machine than a computer or machine that executes the instructions400. In this manner, the instructions 400 and the instructions 500 canbe flexibly executed by the same computer/machine or by separatecomputers/machines. The example instructions 500 begin at block 502 atwhich the example query results interface 302 (FIG. 3) obtains theresults of the models. In some examples, the example query resultsinterface 302 obtains the computer-generated machine learning models andcorresponding performance results from the example analytics controller210 of FIG. 2. In some examples, the example query results interface 302obtains the computer-generated machine learning models and correspondingperformance results stored in the example demographic correction modelsdatabase 142 of FIG. 1.

At block 504, the example data aggregation controller 304 (FIG. 3) runsa query to merge individual results to generate aggregate results. Insome examples, the example data aggregation controller 304 generatesaggregate results of the performance results from the plurality ofcomputer-generated machine learning models. In some examples, the dataaggregation controller 304 aggregates the performance results into atabular format that is easier to read and interpret when analyzing theperformance results of the computer-generated machine learning models.

At block 506, the example model selector 306 (FIG. 3) compares the modeloutput performance results. In some examples, the example model selector306 compares the performance results of the computer-generated machinelearning models based on the aggregated performance results from theexample data aggregation controller 304. At block 508, the example modelselector 306 selects a model based on the performance results. In someexamples, the example model selector 306 compares the performanceresults across the computer-generated machine learning models to selecta computer-generated machine learning model with the best performancerelative to other ones of the computer-generated machine learningmodels. In some examples, the example model selector 306 determines acomputer-generated machine learning model with the relatively bestperformance based on a combination of the performance metrics determinedby the example analytics controller 210 of FIG. 2 (e.g., model accuracy,demographic accuracy, etc.). The example model selector 306 determineswhich combination of features and hyperparameters yielded the bestperformance results among the computer-generated machine learningmodels.

At block 510, the example model selector 306 applies the selected modelto correct error(s). In some examples, the example model selector 306applies the selected computer-generated machine learning model to theimpressions from the database proprietor 102 to correct for any humanerrors (e.g., errors from users self-declaring inaccurate demographicinformation such as, age, gender, etc.) and/or any computer-generatederrors (e.g., misattribution errors, non-coverage errors, etc.) in theuser demographic information. In some examples, the example modelselector 306 uses the selected computer-generated machine learning modelto analyze the impressions in the example enriched impressions database120 of FIG. 1 that were matched to a particular user of the databaseproprietor 102. The demographic information (e.g., age) from applyingthe selected computer-generated machine learning for each user may bestored in the example model inferences database 146 of FIG. 1 forsubsequent use, retrieval, and/or analysis. In some examples, theexample model selector 306 uses the selected computer-generated machinelearning model to analyze the entire user base of the databaseproprietor 102 regardless of whether the users are matched to anyparticular media impressions. After inferring the correct demographic(e.g., age) for each user using the selected computer-generated machinelearning model, the example model selector 306 stores the inferences inthe example model inferences database 146 of FIG. 1. In some suchexamples, when the users matched to particular impressions are to beanalyzed (e.g., the users matched to impressions in the enrichedimpressions database 120), the example model selector 306 extracts theinferred demographic assignment to each relevant user in the enrichedimpressions database 120 that matches with one or more mediaimpressions. After the example model selector 306 applies the selectedmodel to correct error(s), the instructions 500 of FIG. 5 end.

In some examples, block 510 (e.g., applying the selected model tocorrect error(s)) can be performed by the same computer that selects thecomputer-generated machine learning model (as illustrated in FIG. 5).However, in other examples, a separate computer may apply the selectedcomputer-generated machine learning model to correct for thecomputer-generated errors. For example, one computer may generate themachine learning models and analyze the performance for selected one ofthe machine learning models, and a separate computer may apply theselected machine learning model to correct the demographic informationthat may have computer-generated errors.

FIG. 6 is a block diagram of an example processor platform 600structured to execute the instructions of FIGS. 4 and/or 5 to implementthe model generator 140 and the example model analyzer 144 of FIGS. 1-3.The processor platform 600 can be, for example, a server, a personalcomputer, a workstation, a self-learning machine (e.g., a neuralnetwork), a mobile device (e.g., a cell phone, a smart phone, a tabletsuch as an iPad), a personal digital assistant (PDA), an Internetappliance, a set top box, a headset or other wearable device, or anyother type of computing device.

The processor platform 600 of the illustrated example includes aprocessor 612. The processor 612 of the illustrated example is hardware.For example, the processor 612 can be implemented by one or moreintegrated circuits, logic circuits, microprocessors, GPUs, DSPs, orcontrollers from any desired family or manufacturer. The hardwareprocessor may be a semiconductor based (e.g., silicon based) device. Inthis example, the processor 612 implements the example feature interface202, the example hyperparameter interface 204, the example queryselector 206, the example query generator 208, the example analyticscontroller 210, the example query results interface 302, the exampledata aggregation controller 304, and the example model selector 306.

The processor 612 of the illustrated example includes a local memory 613(e.g., a cache). The processor 612 of the illustrated example is incommunication with a main memory including a volatile memory 614 and anon-volatile memory 616 via a bus 618. The volatile memory 614 may beimplemented by Synchronous Dynamic Random Access Memory (SDRAM), DynamicRandom Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory(RDRAM®) and/or any other type of random access memory device. Thenon-volatile memory 616 may be implemented by flash memory and/or anyother desired type of memory device. Access to the main memory 614, 616is controlled by a memory controller.

The processor platform 600 of the illustrated example also includes aninterface circuit 620. The interface circuit 620 may be implemented byany type of interface standard, such as an Ethernet interface, auniversal serial bus (USB), a Bluetooth® interface, a near fieldcommunication (NFC) interface, and/or a PCI express interface.

In the illustrated example, one or more input devices 622 are connectedto the interface circuit 620. The input device(s) 622 permit(s) a userto enter data and/or commands into the processor 612. The inputdevice(s) can be implemented by, for example, an audio sensor, amicrophone, a camera (still or video), a keyboard, a button, a mouse, atouchscreen, a track-pad, a trackball, isopoint and/or a voicerecognition system.

One or more output devices 624 are also connected to the interfacecircuit 620 of the illustrated example. The output devices 624 can beimplemented, for example, by display devices (e.g., a light emittingdiode (LED), an organic light emitting diode (OLED), a liquid crystaldisplay (LCD), a cathode ray tube display (CRT), an in-place switching(IPS) display, a touchscreen, etc.), a tactile output device, a printerand/or speaker. The interface circuit 620 of the illustrated example,thus, typically includes a graphics driver card, a graphics driver chipand/or a graphics driver processor.

The interface circuit 620 of the illustrated example also includes acommunication device such as a transmitter, a receiver, a transceiver, amodem, a residential gateway, a wireless access point, and/or a networkinterface to facilitate exchange of data with external machines (e.g.,computing devices of any kind) via a network 626. The communication canbe via, for example, an Ethernet connection, a digital subscriber line(DSL) connection, a telephone line connection, a coaxial cable system, asatellite system, a line-of-site wireless system, a cellular telephonesystem, etc.

The processor platform 600 of the illustrated example also includes oneor more mass storage devices 628 for storing software and/or data.Examples of such mass storage devices 628 include floppy disk drives,hard drive disks, compact disk drives, Blu-ray disk drives, redundantarray of independent disks (RAID) systems, and digital versatile disk(DVD) drives.

Machine executable instructions 632 represented in FIGS. 4 and 5 may bestored in the mass storage device 628, in the volatile memory 614, inthe non-volatile memory 616, and/or on a removable non-transitorycomputer readable storage medium such as a CD or DVD.

From the foregoing, it will be appreciated that example methods,apparatus and articles of manufacture have been disclosed that generatecomputer-trained machine learning models to correct computer-generatederrors in audience data. The disclosed methods, apparatus and articlesof manufacture generate a plurality of computer-trained machine learningmodels with different combination of features and hyperparameters todetermine the best performing machine learning model relative to otherones of the machine learning models to correct computer-generated errorsin audience data. The disclosed methods, apparatus and articles ofmanufacture improve the efficiency of using a computing device byrunning the plurality of machine-trained machine learning models inparallel. In some examples, running the plurality of machine-trainedmachine learning models manually would take approximately ten minutesfor a single machine learning model (e.g., one feature andhyperparameter combination). The disclosed methods, apparatus andarticles of manufacture are able to generate and run upward of 100different machine learning models in ten minutes. The disclosed methods,apparatus and articles of manufacture are accordingly directed to one ormore improvement(s) in the functioning of a computer.

Example methods, apparatus, systems, and articles of manufacture togenerate computer-trained machine learning models to correctcomputer-generated errors in audience data are disclosed herein. Furtherexamples and combinations thereof include the following:

Example 1 includes an apparatus comprising a query selector to select aplurality of features and a range of hyperparameters, a query generatorto generate a plurality of machine learning models based on theplurality of features and the range of hyperparameters, and initiatetraining of the plurality of machine learning models based ondemographic data in a privacy-protected cloud environment, thedemographic data obtained from database proprietor user accountscorresponding to audience measurement panelists, and a model selector toselect a first machine learning model from the plurality of machinelearning models.

Example 2 includes the apparatus of example 1, wherein the querygenerator is to initiate the training of the plurality of machinelearning models in parallel.

Example 3 includes the apparatus of example 1, further including ananalytics controller to generate performance results for the pluralityof machine learning models.

Example 4 includes the apparatus of example 3, wherein the analyticscontroller is to compare results from training the plurality of machinelearning models to at least some of the demographic data of ones of theaudience measurement panelists who access media via panelist clientdevices, and generate the performance results based on the comparison.

Example 5 includes the apparatus of example 4, wherein the performanceresults include at least one of model accuracy or demographic accuracy.

Example 6 includes the apparatus of example 4, further including a dataaggregation controller to aggregate the performance results of theplurality of machine learning models.

Example 7 includes the apparatus of example 6, wherein the modelselector is to select the first machine learning model from theplurality of machine learning models based on the aggregated performanceresults.

Example 8 includes the apparatus of example 1, wherein theprivacy-protected cloud environment includes first data from at leastone of media providers or third parties combined with second data from adatabase proprietor in a data store, the second data including thedemographic data.

Example 9 includes a non-transitory computer readable storage mediumcomprising instructions that, when executed, cause at least oneprocessor to select a plurality of features and a range ofhyperparameters, generate a plurality of machine learning models basedon the plurality of features and the range of hyperparameters, initiatetraining of the plurality of machine learning models based ondemographic data in a privacy-protected cloud environment, thedemographic data obtained from database proprietor user accountscorresponding to audience measurement panelists, and select a firstmachine learning model from the plurality of machine learning models.

Example 10 includes the non-transitory computer readable storage mediumof example 9, wherein the instructions, when executed, cause the atleast one processor to initiate the training of the plurality of machinelearning models in parallel.

Example 11 includes the non-transitory computer readable storage mediumof example 9, wherein the instructions, when executed, cause the atleast one processor to generate performance results for the plurality ofmachine learning models.

Example 12 includes the non-transitory computer readable storage mediumof example 11, wherein the instructions, when executed, cause the atleast one processor to compare results from training the plurality ofmachine learning models to at least some of the demographic data of onesof the audience measurement panelists who access media via panelistclient devices, and generate the performance results based on thecomparison.

Example 13 includes the non-transitory computer readable storage mediumof example 12, wherein the performance results include at least one ofmodel accuracy or demographic accuracy.

Example 14 includes the non-transitory computer readable storage mediumof example 12, wherein the instructions, when executed, cause the atleast one processor to aggregate the performance results of theplurality of machine learning models.

Example 15 includes the non-transitory computer readable storage mediumof example 14, wherein the instructions, when executed, cause the atleast one processor to select the first machine learning model from theplurality of machine learning models based on the aggregated performanceresults.

Example 16 includes the non-transitory computer readable storage mediumof example 9, wherein the privacy-protected cloud environment includesfirst data from at least one of media providers or third partiescombined with second data from a database proprietor in a data store,the second data including the demographic data.

Example 17 includes a method comprising selecting a plurality offeatures and a range of hyperparameters, generating a plurality ofmachine learning models based on the plurality of features and the rangeof hyperparameters, initiating training of the plurality of machinelearning models based on demographic data in a privacy-protected cloudenvironment, the demographic data obtained from database proprietor useraccounts corresponding to audience measurement panelists, and selectinga first machine learning model from the plurality of machine learningmodels.

Example 18 includes the method of example 17, further includinginitiating the training of the plurality of machine learning models inparallel.

Example 19 includes the method of example 17, further includinggenerating performance results for the plurality of machine learningmodels.

Example 20 includes the method of example 19, further includinggenerating the performance results by comparing results from trainingthe plurality of machine learning models to at least some of thedemographic data of ones of the audience measurement panelists whoaccess media via panelist client devices, and generating the performanceresults based on the comparison.

Example 21 includes the method of example 20, wherein the performanceresults include at least one of model accuracy or demographic accuracy.

Example 22 includes the method of example 20, further includingaggregating the performance results of the plurality of machine learningmodels.

Example 23 includes the method of example 22, further includingselecting the first machine learning model from the plurality of machinelearning models based on the aggregated performance results.

Example 24 includes the method of example 17, wherein theprivacy-protected cloud environment includes first data from at leastone of media providers or third parties combined with second data from adatabase proprietor in a data store, the second data including thedemographic data.

Example 25 includes an apparatus comprising memory, and at least oneprocessor to execute computer readable instructions to at least select aplurality of features and a range of hyperparameters, generate aplurality of machine learning models based on the plurality of featuresand the range of hyperparameters, initiate training of the plurality ofmachine learning models based on demographic data in a privacy-protectedcloud environment, the demographic data obtained from databaseproprietor user accounts corresponding to audience measurementpanelists, and select a first machine learning model from the pluralityof machine learning models.

Example 26 includes the apparatus of example 25, wherein the at leastone processor is to execute the computer readable instructions toinitiate the training of the plurality of machine learning models inparallel.

Example 27 includes the apparatus of example 25, wherein the at leastone processor is to execute the computer readable instructions togenerate performance results for the plurality of machine learningmodels.

Example 28 includes the apparatus of example 27, wherein the at leastone processor is to execute the computer readable instructions tocompare results from training the plurality of machine learning modelsto at least some of the demographic data of ones of the audiencemeasurement panelists who access media via panelist client devices, andgenerate the performance results based on the comparison.

Example 29 includes the apparatus of example 28, wherein the performanceresults include at least one of model accuracy or demographic accuracy.

Example 30 includes the apparatus of example 28, wherein the at leastone processor is to execute the computer readable instructions toaggregate the performance results of the plurality of machine learningmodels.

Example 31 includes the apparatus of example 30, wherein the at leastone processor is to execute the computer readable instructions to selectthe first machine learning model from the plurality of machine learningmodels based on the aggregated performance results.

Example 32 includes the apparatus of example 25, wherein theprivacy-protected cloud environment includes first data from at leastone of media providers or third parties combined with second data from adatabase proprietor in a data store, the second data including thedemographic data.

Although certain example methods, apparatus and articles of manufacturehave been disclosed herein, the scope of coverage of this patent is notlimited thereto. On the contrary, this patent covers all methods,apparatus and articles of manufacture fairly falling within the scope ofthe claims of this patent.

1. An apparatus comprising: a query selector to select a plurality offeatures and a range of hyperparameters; a query generator to: generatea plurality of machine learning models based on the plurality offeatures and the range of hyperparameters; and initiate training of theplurality of machine learning models based on demographic data in aprivacy-protected cloud environment, the demographic data obtained fromdatabase proprietor user accounts corresponding to audience measurementpanelists; and a model selector to select a first machine learning modelfrom the plurality of machine learning models.
 2. The apparatus of claim1, wherein the query generator is to initiate the training of theplurality of machine learning models in parallel.
 3. The apparatus ofclaim 1, further including an analytics controller to generateperformance results for the plurality of machine learning models.
 4. Theapparatus of claim 3, wherein the analytics controller is to: compareresults from training the plurality of machine learning models to atleast some of the demographic data of ones of the audience measurementpanelists who access media via panelist client devices; and generate theperformance results based on the comparison.
 5. The apparatus of claim4, wherein the performance results include at least one of modelaccuracy or demographic accuracy.
 6. The apparatus of claim 4, furtherincluding a data aggregation controller to aggregate the performanceresults of the plurality of machine learning models.
 7. The apparatus ofclaim 6, wherein the model selector is to select the first machinelearning model from the plurality of machine learning models based onthe aggregated performance results.
 8. The apparatus of claim 1, whereinthe privacy-protected cloud environment includes first data from atleast one of media providers or third parties combined with second datafrom a database proprietor in a data store, the second data includingthe demographic data.
 9. A non-transitory computer readable storagemedium comprising instructions that, when executed, cause at least oneprocessor to: select a plurality of features and a range ofhyperparameters; generate a plurality of machine learning models basedon the plurality of features and the range of hyperparameters; initiatetraining of the plurality of machine learning models based ondemographic data in a privacy-protected cloud environment, thedemographic data obtained from database proprietor user accountscorresponding to audience measurement panelists; and select a firstmachine learning model from the plurality of machine learning models.10. The non-transitory computer readable storage medium of claim 9,wherein the instructions, when executed, cause the at least oneprocessor to initiate the training of the plurality of machine learningmodels in parallel.
 11. The non-transitory computer readable storagemedium of claim 9, wherein the instructions, when executed, cause the atleast one processor to generate performance results for the plurality ofmachine learning models.
 12. The non-transitory computer readablestorage medium of claim 11, wherein the instructions, when executed,cause the at least one processor to: compare results from training theplurality of machine learning models to at least some of the demographicdata of ones of the audience measurement panelists who access media viapanelist client devices; and generate the performance results based onthe comparison.
 13. (canceled)
 14. The non-transitory computer readablestorage medium of claim 12, wherein the instructions, when executed,cause the at least one processor to aggregate the performance results ofthe plurality of machine learning models.
 15. The non-transitorycomputer readable storage medium of claim 14, wherein the instructions,when executed, cause the at least one processor to select the firstmachine learning model from the plurality of machine learning modelsbased on the aggregated performance results. 16-24. (canceled)
 25. Anapparatus comprising: memory; and at least one processor to executecomputer readable instructions to at least: select a plurality offeatures and a range of hyperparameters; generate a plurality of machinelearning models based on the plurality of features and the range ofhyperparameters; initiate training of the plurality of machine learningmodels based on demographic data in a privacy-protected cloudenvironment, the demographic data obtained from database proprietor useraccounts corresponding to audience measurement panelists; and select afirst machine learning model from the plurality of machine learningmodels.
 26. The apparatus of claim 25, wherein the at least oneprocessor is to execute the computer readable instructions to initiatethe training of the plurality of machine learning models in parallel.27. The apparatus of claim 25, wherein the at least one processor is toexecute the computer readable instructions to generate performanceresults for the plurality of machine learning models.
 28. The apparatusof claim 27, wherein the at least one processor is to execute thecomputer readable instructions to: compare results from training theplurality of machine learning models to at least some of the demographicdata of ones of the audience measurement panelists who access media viapanelist client devices; and generate the performance results based onthe comparison.
 29. (canceled)
 30. The apparatus of claim 28, whereinthe at least one processor is to execute the computer readableinstructions to aggregate the performance results of the plurality ofmachine learning models.
 31. The apparatus of claim 30, wherein the atleast one processor is to execute the computer readable instructions toselect the first machine learning model from the plurality of machinelearning models based on the aggregated performance results. 32.(canceled)